In [2]:
install.packages("tidyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("gtable")
install.packages("gamlss")
install.packages("mclust")
install.packages("igraph")
install.packages("devtools")
install.packages("conflicted")
Mutations in the genome can alter the sequence so that non-coding sections become coding. e.g. through the introduction of a start codon or the alteration of a stop codon. In order to identify these non-canonical genomic products, protein databases that capture genetic variation and non-canonical genomic products are generated either by enriching canonical protein sequences or by running six reading frame translation of the entire genome (1).
In [2]:
library(tidyr)
library(dplyr)
library(ggplot2)
library(scico)
theme_set(theme_bw(base_size = 11))
In this tutorial, we will analyze the non-canonical genomic products identified in breast cancer by Johansson et al. (2). Note that this tutorial does not cover the database generation, search, and validation of identification results. These bioinformatic procedures are very demanding and we strongly advise to make sure that they are in place at your lab or at the facility processing the data before conducting any proteogenomic experiment. The proteogenomic identification results by Johansson et al. (2) are reported in Supplementary Data 6, available here in the course repository.
For this tutorial, the Novel Peptides table was extracted to an R-friendly text format, and is available in resources/data/novel_peptides.gz.
In [3]:
novelPeptidesDF <- read.table(
file = "resources/data/novel_peptides.gz",
header = T,
sep = "\t",
comment.char = "",
quote = "",
stringsAsFactors = F
)
In [5]:
classesDF <- as.data.frame(
table(
novelPeptidesDF$class
)
) %>%
rename(
class = Var1,
n_peptides = Freq
) %>%
arrange(
desc(n_peptides)
)
print(classesDF)
In [9]:
novelPeptidesDF %>%
filter(
class != "intergenic"
) %>%
select(
nearest_gene, category
) -> geneDF
categoriesDF <- as.data.frame(
table(
geneDF$category
)
) %>%
rename(
category = Var1,
n_peptides = Freq
) %>%
arrange(
desc(n_peptides)
)
print(categoriesDF)
In Supplementary Table 8, the authors provide the abundance for novel peptides monitored in normal tissue and tumors for five patients. The table was extracted to an R-friendly text format for this tutorial, and is available in resources/data/novel_peptides_paired.gz.
In [10]:
novelPeptidesDF <- read.table(
file = "resources/data/novel_peptides_paired.gz",
header = T,
sep = "\t",
comment.char = "",
quote = "",
stringsAsFactors = F
) %>%
gather(
"Control_1", "Control_2", "Control_3", "Control_4", "Control_5",
key = "control_id",
value = "control"
) %>%
select(
-control_id
) %>%
gather(
"LumA_1", "Her2_2", "LumB_3", "Basal_4", "Her2_5",
key = "tumor_id",
value = "tumor"
) %>%
separate(
col = "tumor_id",
into = c("tumorType", "patientNumber"),
sep = "_"
) %>%
mutate(
patientId = paste("Patient", patientNumber)
) %>%
arrange(
abs(tumor - control)
)
In [11]:
ggplot(
data = novelPeptidesDF
) +
geom_hline(
yintercept = quantile(
x = novelPeptidesDF$tumor,
probs = c(0.2, 0.8),
na.rm = T
),
col = "black",
linetype = "dotted"
) +
geom_vline(
xintercept = quantile(
x = novelPeptidesDF$control,
probs = c(0.2, 0.8),
na.rm = T
),
col = "black",
linetype = "dotted"
) +
geom_point(
mapping = aes(
x = control,
y = tumor,
col = log10(ms1_area)
)
) +
facet_grid(
tumorType ~ .
) +
scale_x_log10(
name = "Intensity in control tissue"
) +
scale_y_log10(
name = "Intensity in tumor tissue"
) +
scale_color_scico(
name = "MS1 Area [log10]",
palette = "batlow"
) +
theme(
legend.position = "top",
panel.grid = element_blank()
)
In [ ]: